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ABSTRACT 


In order to learn complex grammars, recurrent neural networks (RNNs) require sufficient computa- 
tional resources to ensure correct grammar recognition. A widely-used approach to expand model 
capacity would be to couple an RNN to an external memory stack. Here, we introduce a “neural 
state” pushdown automaton (NSPDA), which consists of a digital stack, instead of an analog one, that 
is coupled to a neural network state machine. We empirically show its effectiveness in recognizing 
various context-free grammars (CFGs). First, we develop the underlying mechanics of the proposed 
higher order recurrent network and its manipulation of a stack as well as how to stably program its 
underlying pushdown automaton (PDA) to achieve desired finite-state network dynamics. Next, we 
introduce a noise regularization scheme for higher-order (tensor) networks, to our knowledge the 
first of its kind, and design an algorithm for improved incremental learning. Finally, we design a 
method for inserting grammar rules into a NSPDA and empirically show that this prior knowledge 
improves its training convergence time by an order of magnitude and, in some cases, leads to better 
generalization. The NSPDA is also compared to a classical analog stack neural network pushdown 
automaton (NNPDA) as well as a wide array of first and second-order RNNs with and without external 
memory, trained using different learning algorithms. Our results show that, for Dyck(2) languages, 
prior rule-based knowledge is critical for optimization convergence and for ensuring generalization to 
longer sequences at test time. We observe that many RNNs with and without memory, but no prior 
knowledge, fail to converge and generalize poorly on CFGs. 


Introduction 


Despite their success, artificial neural networks (ANNs), especially recurrent neural networks (RNNs), have repeatedly 
been shown to struggle with generalizing in a sophisticated, systematic manner, often uncovering misleading statistical 
associations instead of true casual relations. Verifying what is learned by these black-box models remains an open 
challenge, centering around one central issue — the lack of interpretability and modularity. The fact that successful ANN 
optimization depends heavily on large quantities of data only serves to further worsen the problem. 


One research direction towards developing more interpretable ANNs focuses on rule extraction from and assimilation 
of rules into RNNs [I] [2]. To solve difficult grammatical inference problems, various types of specialized RNNs have 
been designed BIA [5] [6] However, it has been shown that RNNs augmented with external memory structures, 
such as the neural network pushdown automaton (NNPDA), are more powerful than RNNs without, both historically 
[9] and recently, using differentiable memory [12] {13} [14] {15} [11] [16] [17] [18]. Yet most of these models often 
lack interpretability and how they learn any given grammar is still debatable. In the past, rule integration methods 
have been proposed to tackle the interpretability issue [9] [19] and offer a promising path towards the design of ANNs 
with an underlying knowledge structure that is bit more understandable and transparent. However, to the best of our 
knowledge, there exists no method for inserting rules into the states of the far more powerful class of higher order, 
memory-augmented RNNs. 


In working towards interpretable, memory-based neural models, in this work, our contributions are the following: 


e We propose the neural state pushdown automaton and its incremental training method, which exploits the 
concept of iterative refinement 


e We develop a novel regularization method that empirically yields better generalization in complex, memory- 
based RNNs. To our knowledge, we are the first to propose a weight regularizer that works with higher-order 
RNNs. 


e We propose a method for programming states into a neural state machine with binary second and third-order 
weights . 


e We develop a method for inserting rules into stack-based recurrent networks. 


e We compare our model with the NNPDA and other RNNs, trained using different learning algorithms. 


Motivation & Related Work 


Research related to integrating knowledge into ANNs has existed for quite some time, such as through the design of 
state machines [20]{19]. Recent efforts in the domain of natural language processing have shown the effectiveness of 
using state machines for tasks such as visual question answering, which allow an agent to directly use higher-level 
semantic concepts to represent visual and linguistic modalities [21]. With respect to rule-insertion itself, there exists a 
great deal of work showcasing its effectiveness when used with ANNs[22] as well as with RNNs [9] [19]. Notably, [19] 
showed how deterministic finite automaton rules could be encoded into second order RNNs. 


One important, classical model that we draw inspiration from is the neural network pushdown automaton (NNPDA) 
[23]. The structure of our proposed model is similar to the NNPDA, but, as we will discuss, the major difference is that 
the model works with a digital stack as opposed to an analog one. Interestingly enough, prior work has also shown how 
to “hints” into the NNPDA, where knowledge of “dead states” can be used to guide its learning process [23]. In the 
spirit of this hint-based methodology, we will develop a method for encoding useful rules related to target CFGs into 
our neural state pushdown automaton (NSPDA). This, to our knowledge, is the first approach of its kind, since no rule 
methodology has been previously proposed for complex state-based models. Creating such a procedure allows us to 
both exploit the far greater representational capabilities of memory-augmented RNNs while offering an intuitive way 
for understanding the knowledge contained and acquired by RNNs. 


In this work, we will focus on RNNs that control a discrete stack, particularly our proposed NSPDA. We will empirically 
determine if the inductive biases we encode into its synaptic weights speed up the parameter optimization process 
and, furthermore, improve model generalization over longer sequences at test time. Furthermore, the results of our 
experiments, which compare a wide variety of RNNs (of varying order, with and without memory), will strongly 
contradict the claim presented in recent work [24], which specifically claims that first order RNNs, like the popular 
gated recurrent unit RNN 25], are as powerful as a PDA. In essence, our work demonstrates that for an RNN to 
recognize a complex CFG, it will, at least, require external memory. Our results also demonstrate the value of encoding 
even partial PDA information which positively impacts convergence time and model generalization. 


The Neural State Pushdown Automaton 


Neural Architecture 


The model we propose, the NSPDA with iterative refinement is shown in figure|1} The NSPDA consists of fully 
connected recurrent neurons which we will label as state neurons, primarily to distinguish them from the neurons 
that function as output neurons. Introducing the concept of state neurons is important when considering the notion 
of higher-order networks, i.e., second or third order RNNs, which allows us to map state representations directly to 
outputs. In this model, at each time step t, the state neuron receives signals from the input neurons, its previous state, 
and the stack-read neurons. The input neurons process a string, one character at a time, while non-recurrent neurons, 
also labeled as “action neurons’, represent an operation to be performed on a stack data structure, i.e., Push/Pop/No-op. 
The action neurons are also designated as the controller which can either be recurrent or linear (recurrent controllers 
usually perform better in practice, so we focus on these in this paper). Furthermore, “read” neurons are used to keep 
track of the symbols present at the top of the stack. 


To make concrete the above high-level description, consider a single hidden-layer NSPDA. A full symbol sequence 
sample (y, X) is defined as X = {x,xX2,--- , Xr } where the binary label y indicates whether the sequence is valid 
(1) or not (0). When processing a (binary) symbol/token x; € {0, Lees at the discrete time step t, the NSPDA is 
engaged with computing a new state variable vector z; € IR7*!, where L is the total number of input/sensory neurons 
(or dimensionality of the input space, sometimes classically refered to as alphabet size) and J is the total number of 


state neurons. The action neuron vector is defined as a € R”*! and the read neuron vector is defined as r € R’™!, i.e., 
the action and read spaces are of the same dimensionality of the input or |x| = |r| = |a|. Taken together, the above sets 
of input, state, and read neurons represent a full NSPDA model with parameters © = {W*, W°, W°}. Crucially, W* 
and W° are both 4-dimensional (4D) synpatic weight tensor, i.e., the binary “to-state” tensor W* € {0, 1}7*"* 2x4 
and the 4D tenary to-action tensor W* € {—1,0,1}7*"*"*¥ (note that: —1 is “pop”, 0 is “no-op”, and 1 is “push”). 
At t, inference (for a third order NSPDA) is conducted as follows: 


zi = g(Ej k Wêr (2 rE xl) + bt) (1) 
aiy = Ejea Wirlzl rf, 2h) + bi.) (2) 
a, if ay Z 
T= 4 a ifa,=1 (3) 
a3 ifa =-—1 


where a, ~ U (0.0001, 0.008), ag ~ U (0.901, 0.992), and ag ~ U (0.025, 0.110), are threshold values that determine 
what the next state of the discrete read unit rt will be (sampled uniformly from a special interval to create continuous 
value for backprop to work with). Note that z;41 is the next hidden state, a; is the next stack action, and r+; is the 
next value of the neuron that reads the content at the top of the stack. g(v) and f(v) are non-linear activation functions, 
specifically, quantized sigmoidal functions, defined as: 


1 x 
g(v) = UFen) f(v) = 2g(v) -1 (4) 
fa wae © 
1 if f(v) > 0.13 
fv~)=4 0 if -0.09 < f(v) < 0.13 (6) 


—1 otherwise. 


As the NSPDA processes a string, a prediction ĝ; of its validity is made at each step. Specifically, the output weights 
W° € R’*! (and bias scalar b°) are used to map the state vector z+ to the output space. The output model is defined as 
Dı = 0(W° - zi + b°), where o(v) is the logistic link function. 


The actual external stack itself is manipulated by discrete-valued action neurons that trigger a discrete push or pop 
action (as given by Equation|2). Take, for example, a 2-letter alphabet, i.e., {a, b}. The dimensions of the action and 
read spaces would then, in this case, be |a| = |r| = 2. When using a digital stack, the following actions can be taken: 


e PUSH: This means that the current input is pushed to the top of the stack. Example: To push the symbol “a”, 
use a, =< 1,0 > and r; =< 0.955, 0.008 >. 


e POP: This means that the element is removed from the top of the stack. Example: To remove the symbol “b”, 
use a, =< 0,—1 > and r, =< 0.008, 0.065 >. 


e NO-OP: This simply means “no operation, or, in other words, nothing is to be done with the stack. Example: 
use a; =< 0,0 > and r; =< 0.008, 0.008 >. 


In the case of the vector r;, we are reading the symbol currently located at the top of the stack (at each time step) 
(corresponding read vectors are shown above in the action vector examples0. Our goal is to make sure the RNNs choose 
the correct action during training and yet still maintain stable binary read states z+. 


Learning and Optimization 


First, we define the loss function used to both measure the performance of the network as well as optimize its parameters. 
Classically, state neural models such as the NNPDA exclusively made use of a binary loss function that only considered 
if a string was valid or invalid [26]. Furthermore, these models only made a prediction/classification at the very end of 
the sequence. In contrast, the NSPDA is an iterative, step-by-step predictive model. Thus, we consider using a sequence 
loss based on binary cross entropy}! | The instantaneous loss, for a single sequence (y, X), is: 


T 
Lly, X, 0) = X` —ylog(g)) — (1 — y) log(1 — ĝe). (7) 


t=1 


‘In preliminary experiments, models using a squared error loss, with and without regularization penalties, had great difficulty in 
converging. We found using cross entropy was far more effective. 
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Figure 1: The NSPDA shown making predictions over K = 2 steps of iterative refinement. 


where ĝ; is the t-th prediction/output from the final state neuron. Note that y is copied each step in time, which injects 
an extra error signal throughout the sequence length, improving the optimization process (as opposed to relying on only 
a single output error signal to be effectively propagated backwards through the underlying computation graph). 


To compute updates for the NSPDA’s parameters, we employed several gradient-based approaches, including the 
popular and common back-propagation through time (BPTT) procedure as well as online algorithms such as real-time 
recurrent learning (RTRL) and unbiased online recurrent optimization (UORO) [28]. In short, all of these algorithms 
compute gradients of the loss function (Equation[7) with respect to NSPDA weights. The primary difference between 
the algorithms is that BPTT is based on reverse-mode differentiation routine while RTRL is based on forward-mode 
differentiation (and UORO is a faster, higher variance approximation of RTRL). In further detail, we describe UORO 
and RTRL in the appendix. While UORO and RTRL are not commonly used to train modern-day RNNs, they offer faster 
ways to train them without requiring graph unfolding. Thus, we compare the results of using each in our experiments. 


Iterative Refinement 


One important element we introduced into the training protocol of the NSPDA is that of iterative refinement, an 
algorithm proposed in the signal processing literature for incorporating partial iterative inference into a next-step 
predictive RNN [29]. At a high-level, this means that, during training, at step t, the NSPDA is forced to predict the 
same target (y+) K times (except for the state transitions that are provided as “hints”, of which we will describe in a 
later section). Crucially, the state vector is still carried over these K steps, meaning the recurrent synapses relating 
the state of the model at time t to t + 1 .To adapt iterative refinement to a next-step sequence model like the NSPDA, 
iterative refinement can cleanly introduced by manipulating the sequence loss of Equation[/Jas follows: 


D(ĝ, y) = —ylog(g)) — (1 — y) log(1 — 9) (8) 
K=S(t) 
L(y, X, S, 9) = 5 5 D(Gt,k; y) (9) 
t k=1 


noting that we have introduced the variable S to augment the sample (y, X). S is an integer sequence computed 
as follows: S = K(1 — H) + H where H is a binary “hint” vector (automatically generated) of the form H = 
{ho, hi, ++} , hr} (hy = 1 signals a hint is used, while h; == 0 is “no hint”). Empirically, we found K = 4 worked 
well. In [29], using an RNN’s recurrent weights as a lateral processing mechanism was related to an RNN acting 
as a deep feedforward network with tied weights across K hidden layers (a “prediction episode”). This means that 
additional nonlinearity (via depth) is being efficiently exploited without incurring the memory cost of storing extra 
weights. We found that iterative refinement introduces greater stability into learning process primarily when gradient 
noise is used. Note that, even in this case, while we work with full precision weights for gradient computation, before 
evaluation is conducted, the weights are converted to discrete values. 


Two Stage Incremental learning 


Incremental learning, or, in other words, training procedures that sort data samples based on their inherent difficulty 
and progressively present them to a neural agent progressively, has been shown to quite effective when training RNNs 


on input data that is known to have some structure [26]. Based on this prior finding, we developed a two-stage 
incremental learning approach for improving a higher-order RNN’s ability to generalize to longer sequences. Formally, 
Algorithm|I]depicts the overall process. We found that using a stochastic learning rate worked better in the first 
stage while a fixed learning rate combined with stochastic noise process applied to the weights (similar to gradient noise) 
worked better during second stage. As we will see later experimentally, whenever the data has some exploitable structure 


Algorithm 1 Two Stage Incremental Learning 
Input: O (model weights), training set D, validation set V, Nr, (midpoint length threshold), A (learning rate) 


// Stage #1 

Nmaxc = maxLen(D) > Calculate longest string length 
// Sequential Curriculum Update Phase 

D =9 


for N; = 1 to Nr, do 

D, = Extract from D all strings lengths < N; 

TRAIN(Model(9), A, Di) > Single pass through D, 
// Random Curriculum Phase 
while Model(©) not converged on V or e,, < 200 do 

TRAIN(Model(9), A, Di), en = en + 1 
// Stage #2 
// Sequential Curriculum Update Phase 
D, =0 
for N; = 1 to Nmaz do 

D, = Extract from D all strings lengths < N; 

TRAIN(Model(9), A, Di) > Single pass through D; 
// Random Curriculum Phrase 
while Model(©) not converged on VY or en < 350 do 

TRAIN(Model(8), A, Di), en = En + 1 
return © > Return final trained model weights 


that allows for an automatic sorting of samples by increasing complexity, incremental learning is highly effective in 
training higher-order RNNs. In the case of CFGs, we can sort samples based on string length and progressively build a 
model that can learn to generalize to increasingly longer string sequences. Algorithm|I]depicts the full process (note 
that we set Nr = 14 in this paper and e,, is a variable that marks the number of epochs so far). 


Regularizing Higher Order RNNs: 


When training any RNN for long periods of time, the model tends to memorize the input training data which damages 
its ability to generalize to unseen sequence data, i.e., overfitting. Higher order RNNs are also susceptible to overfitting 
given their high-capacity and complexity, and yet, no regularization has ever been proposed to help these kinds of RNNs 
to combat overfitting. In this work we extend an adaptive (layer-dependent) noise scheme that was originally proposed 
for training neurobiologically-plausible ANNs [32], which showed strong positive results for simple feedforward 
classification tasks, to RNNs. Notably, our noise-based regularizer applies to higher-dimensional tensors, which are 
fundamental to implementing any n-th order RNN. We are also motivated by the fact that injecting noise to gradients 
can encourage exploration of an RNN’s error optimization landscape in one of two ways: 1) at the input, i.e., data 
augmentation [33], or 2) at the recurrence [34]. Our regularizer falls under the second easel 


The key details of our noise-based regularizer are depicted in Algorithm 2] Based on preliminary experiments, we 
found that a noise level less than 30% and more than 8% helps the network to converge faster and, more importantly, 
generalize better on unseen sequences, longer that than those found in the training set. Experimentally, later we will see 
that this regularizer improves generalization even when prior knowledge is not integrated into the RNN. 


?We implemented a data augmentation approach but found it yielded poor results when learning context-free grammars. 


Algorithm 2 Adaptive Noise Regularizer 


Input: Tensor W € R4*8*C*?P e.g., W5 or W° 
// Np=Percentage of Noise, “:--” means k = k + 1 
function CREATEPARTITIONS(W, K,N,,) > Partition sub-routine for noise regularization function 
//len(W) = A x B (calculate length by multiplying 1st two tensor dimensions) 
// s(Pi, Np) randomly selects N, matrices in P; 
// Divide W into 3 partitions {P}, P2, P3} 
// Pi ={M1, M2,- , Mien(w)}, Mi € ROX? 
if len(W) is odd 


Pi =W[k = 1,- , K/3] 

P> =W{k=k/3+1,- p2 

Py = Wlk = (2k/3) + 1,- ,K] 

else len(W) is even 

Pi, =W{k=1,--- ,(K —1)/3] 

Py = WÎk = (k — 1)/3 +1,--- ,(2K —1)/3] 
P; = Wik = (2k — 1)/3 +1,- , K] 


, / Create set Q of N, random matrices from each P; 
= {s(Pi, Np), s(P2, Np), 8(Ps, Np) } 


pa Q 

function ADAPTIVE NOISE(Q) 
p~N(u=0,0 = 1) > Draw Gaussian scalar sample 
for each M in Q do > for each matrix in Q 


M = (P 8)M, p = p/2 
// Remap matrices M in Q to tensor shaped like W 
W + remap(Q) 
Return W 
// Use updated weight matrix for gradient computation 


Integrating Prior Knowledge 


Programming and Inserting Rules 


We start by defining the data generating process that any RNN is to learn from, i.e., a PDA that generates a set of 
positive and negative strings. Formally, the M-state PDA is defined as a 7-tuple (Q, £, T, 8, q?, L, F) where: 


e X ={a',--- ,a',--- ,a”} is the input alphabet 

e Q={s!,---,s™,--- , sM} is the finite set of states 

e Tis known as stack alphabet (a finite set of tokens) 

e q? is the start state 

e | is the initial stack symbol 

e F C Q is the set of accepting states 

e ô C Q x (SU)| x T > Q x I*) is the state transition. 


To insert rules related to known state transitions into the (V-state) NSPDA, one needs to program its recurrent weights 
(which could be second or third order). Since the number of states in PDA is not known before hand, we assume that 
J > M and that the network has enough capacity to learn an unknown context-free grammar. 


In order to program and insert rules, we propose adapting methodology originally developed for second-order RNNs 
and deterministic finite state automata (DFA) to the case of PDA-based RNNs. Specifically, we will exploit the 
similarity between the state transitions of the target PDA and the underlying dynamics of a stack-driven RNN. Consider 
a known transition 5(s’,a', Ts) = (sf, 1); where Ts is the top of the stack and y is the sequence of symbols replacing 
Ts. We then identify PDA states sf and st, which correspond to state neurons z/ and z’, respectively. Recall that each 
symbol has specific stack operations associated with it, which provide prior knowledge as to when to push and when to 
pop from the stack. It is desirable that the state neuron zi has a high output close to 1 and zł has a low output close to 0 
after reading an input symbol a! using input neuron x” and the top of the stack Ts using read neuron r! (remember 
that a read depends on an action neuron, as depicted in model Equation[3). This condition can be achieved by doing the 


following: 1) set the (third order) weights W;*,,, to a large positive value, which helps to ensure that the state neuron 


a at the next time step t + 1 will be high (and since ĝ(v) is sigmoidal, this tends towards 1), and 2) set W711 tO a 


large negative value, which would make the output of the state neuron g low (tending ĝ(v) towards 0). 


The next item to consider are the (ternary) action weights stored in W‘,,,, which drive the action neurons that yield the 
stack operations (recall that [-1,0,1] maps to [pop,no-op,push]). First, we must assume that the total contribution of the 
weighted output of all state neurons can be neglected — this can be achieved by setting all other state neurons to the 
lowest value. In addition, we assume that each state neuron can only be assigned to one known state of the PDA. If we 
have prior knowledge of accepting and non-accepting states related to a particular neuron, we may then bias its output 
2 +1: We start from 7 = 1 (the leftmost neuron in the vector z;) and work towards i = J, programming each one by one. 


Armed with these assumptions, we can then stably encode rules into the NSPDA by programming the weight W$ p; to 


be large positive value if the PDA’s state sê is an accepting state. Otherwise, we set W741 to be a large negative value if 
the state is non-accepting. If no such knowledge of the PDA is available, W;;,, remains unchanged. 
Though described for a third order NSPDA, the above approach for programming weights also applies to a second 


order model as well. In a lower order NSPDA, with 3D weight tensors Wi; p and Ws p» State updates and transitions 


are conducted by concatenating a read neuron rt with an input neuron 2* to create a single vector. However, when 
programming a second order model, we are now working with a DFA instead of a PDA, which limits the capabilities 
of the NSPDA (as well as restricts its capacity) since we do not possess any knowledge about what to push or pop. 
However, when combined with our proposed learning procedure that incorporates iterative refinement, we believe that 
the second order NSPDA can still learn what action to perform. However, the issue of dimensionality arises — the state 
space of a lower order model is very large when compared to that of a third order NSPDA. In the case of a PDA-based 
model, pushing multiple symbols might lead to reaching same accepting state, however, in case of a DFA-based model 
(the second order NSPDA), we create separate sets of accepting states for each symbol. We found that this splitting 
mechanism was crucial in getting our network to work perfectly with a digital stack. 


While the above rule insertion scheme seems simple enough, determining the actual values for the weights that are to 
be programmed can be quite problematic. In the case of third order synaptic connections (with binary weights), with 
just 4 neurons, there are 2°56 different combinations, which would quickly render our method impractical and near 
useless. However, we can sidestep this computational infeasibility by making use of “hints” [19] within the framework 
of “orthogonal state encoding”. By assuming that the PDA starts generating a valid grammar at its initial state, we can 
then randomly choose a single state and make the output of one state neuron equal to 1. The outputs of all the other 
neurons are set to be equal to 0. Following this, we set the values of weights (according to known state transitions) 
according to the approach described above. Notably, these weights, though initially programmed, are still adaptable, 
making them amenable to tuning to a target grammar underlying a data sample. Programming the weights of second or 
third order networks jointly impacts the behavior of the state neurons z;, the read neurons r; and the input neurons x;. 
Following the scheme we described above yields sparse NSPDA representations of PDA states. 


It is difficult to program an NSPDA with a minimal number of states, despite the fact that we have a theoretical guarantee 
that the third order model is equivalent to PDA dynamics [23]. 


We will observe in our results, the proposed methodology significantly reduces the NSPDA’s convergence time during 
optimization (leading to roughly comparable training time characteristic of first order RNNs), which is particularly 
important given the fact that its inference process entails 4D tensor products (which are far more expensive than the 
matrix computations of modern-day RNNs). 


Experimental Details 


We focused on five context-free grammars, some labeled as Dyck(2) languages, which are some of the more difficult 
CFGs to recognize. For each grammatical inference task, we create a dataset that contains 1987 positive and 2021 
negative (string) samples. Each sequence was of length T which was sampled via T ~ U(1, 21), where U(a, b) is the 
uniform distribution defined over the interval [a, b]. From the samples generated, we randomly sampled a subset from 
the total number of tokens generated. 


The number of state neurons for a second order NSPDA is set according to the following formula: J = M+ ~ U(12, 29). 
For a third order NSDPA, the number of state neurons was set according to: J = M+ ~ U(2,6). 


All models made use of the iterative refinement loss (Equation 9] with K = 4), weight updates were computed using 
whichever algorithm, i.e., BPTT, truncated BPTT (TBPTT) (50 steps back in time), RTRL, or UORO, yielded best 
performance for a given model. For higher order networks, UORO performed better and we use this to optimize all 


Palindrome ab” a®bo™cb™a™ | art™prem 
Rule Method wil W2 W1 | W2 | WI | W2 | Wi | W2 
NNPDA w/o hints 100 81 || 280 | 215 | NA | NA || NA | NA 
NNPDA w/ dead neuron hints 92 83 || 212 | 192 485 145 250 | 195 
NSPDA w/o hints 91 79 || 221 | 192 || 488 | 159 || 339 | 293 
NSPDA w/ Hint #1 80 75 || 190 | 170 || 410 | 140 || 240 | 160 
NSPDA w/ Hint #2 70 72 || 150 | 138 || 389 | 134 || 222 | 148 


Table 1: Comparison between NSPDAs trained w/ and w/o hints using either 2nd order weights (W1) or 3rd order 
weights (W2). 


Palindrome a”b” a” b” ch™a™ grtmpnem 
Train Method | M1 M2 M1 M2 M1 M2 M1 M2 
Standard 5699 | 5912 || > 200000 | > 210000 || > 240000 | > 233000 || > 320000 | > 315000 
IL 2678 | 2552 108200 104556 192001 192551 222171 222144 
2-IL (ours) 2001 | 2199 9899 10001 130192 129998 177189 177190 


Table 2: Incremental learning NSPDA (without hints) performance results. Each value is a measurement of the average 
number of characters required to reach convergence (M1 = 2nd Order NSPDA, M2 = 3rd order NSDPA). 


Palindrome ab” a®b™cb™a™ | artmpnrem 

Regularization Method | M1 M2 M1 | M2 | M1 | M2 | M1 | M2 

w/o reg 4.55 | 2.99 || 1.28 | 1.55 || 5.51 | 4.19 || 2.18 | 2.00 

w reg 0.00 | 0.00 | 0.06 | 0.01 || 0.99 | 0.00 || 0.09 | 0.00 

Table 3: Mean classification error for an NSPDA w/ & w/o adaptive noise (tested on string length up to T = 60). 

Palindrome a”™b” a”b"cb™a™ antmpn cm Parenthesis 
RNN Type Train | Test | Train | Test | Train | Test | Train | Test | Train | Test 
RNN 0.00 | 78.2 0.00 | 74.11 0.00 | 83.33 0.00 | 73.69 || 30.72 | 99.96 
LSTM 0.00 | 12.58 0.00 | 13.26 0.00 | 14.22 0.00 | 10.56 || 48.92 | 97.88 
LSTM-p 0.00 | 8.69 0.00 | 11.25 0.00 | 13.99 0.00 | 12.88 || 49.68 | 99.00 
GRU 0.00 | 14.99 0.00 | 14.89 0.00 | 19.22 0.00 | 14.00 || 43.21 | 98.70 
Stack RNN 40+ 10 0.00 | 4.99 0.00 | 3.01 0.00 | 34.19 0.00 | 58.66 || 10.25 | 9.38 
Stack RNN 40+ 10+ rounding 0.00 | 0.09 0.00 | 0.89 0.00 1.01 0.00 | 0.79 4.03 | 3.968 
listRNN 40+5 0.00 | 0.39 0.00 | 2.29 0.00 | 19.63 0.00 1.27 4.89 | 7.45 
2nd Order RNN 0.00 | 9.26 0.00 | 8.51 0.00 | 17.52 0.00 | 11.17 || 27.89 | 37.42 
2nd Order RNN reg (ours) 0.00 1.88 0.00 | 2.09 0.00 | 2.19 0.00 | 0.99 || 21.69 | 27.59 
NNPDA 0.00 | 7.00 0.00 | 15.25 0.00 | 17.49 0.00 | 55.28 5.96 | 29.21 
NNPDA reg (ours) 0.00 | 4.28 0.00 | 14.20 0.00 | 13.00 0.00 | 41.01 5.62 | 27.09 
NSPDA, M1 (ours) 0.00 | 0.00 0.00 | 0.06 0.00 | 0.99 0.00 | 0.09 0.58 | 2.58 
NSPDA, M2 (ours) 0.00 | 0.00 0.00 | 0.01 0.00 | 0.00 0.00 | 0.00 0.01 | 0.88 


Table 4: Mean classification error for various recurrent architectures when tested on strings of length up to T = 60. 


RNNs of this type in this study) | (in the appendix, we offer a comparison of the various weight update rules when 
training an NSPDA). Gradients were hard clipped to 13. Parameters were updated using stochastic gradient descent 
(SGD) which made use of the stochastic learning rate annealing scheme proposed in with initial learning rate of 
0.1005000321. All models were trained for a maximum of 500 epochs (or until convergence was reached, which was 
marked as 100% training accuracy). Experiments for each and every model was repeated 5 times. 


All of our models used our proposed rule encoding scheme and all of the RNNs were trained using our proposed 
two-stage incremental learning procedure. In Table P} to demonstrate the value of our proposed two stage incremental 
training procedure (2-IL), we compare an NSPDA trained without any incremental learning, one with ours, and one 
with the incremental learning approach (IL) proposed in and find that the our approach yields the best results 
across all grammars. All higher-order RNNs made use of our proposed adaptive noise regularizer, though in Table 
we examine how the NSPDA performs with and without the proposed regularizer. With respect to the hints used, 
for all tables presented in the main paper, whenever hint usage is indicated, we mean Hint #2 (which worked the best 
empirically). In the appendix, we provide a detailed breakdown and ablation for all of the models investigated in this 
paper. Specifically, we present results for models that were trained with and without our regularizer as well as under 
various hint insertion conditions (no hints, Hint #1, and Hint #2). 


3For all first order RNNs, we found BPTT worked best and use that to train all RNNs of this type in our experiments. 


Baseline Algorithms: In order to provide the proper context do demonstrate the effectiveness of our proposed NSPDA, 
we conduct a thorough comparison of our model to as many baseline RNN models as possible. These models include 
a plethora of first order RNNs such as variations of the stack-RNN (depth k = 2, all other metaparameters set 
according to original source) including the two variant models as well as the linked-list model (using the same model 
labels as the original paper), the Long Short Term Memory RNN with (LSTM) and without peepholes (LSTM-p), 
the Gated Recurrent Unit (GRU) RNN [25], and a simple Elman RNN. We also compared to gated first order RNNs 
with multiplicative units, but due to space constraints, we report these results in the appendix. We furthermore compare 
against second order RNNs with (2nd Order RNN) and without regularization (2nd Order RNN reg), as well as the 
classical NNPDA with and without regularization (NNPDA reg). All baselines RNNs had a single layer of < 50 neurons 
and individual hyperparameters for each was optimized based on validation set performance. 


Results and Discussion 


To the best of our knowledge, we are the first to conduct a comparison across such a wide variety of RNN models of 
both first, second, and third order, with and without external (stack-based) memory. For simple algorithmic patterns 
(non-Dyck(2) CFGs), first order RNNs like the LSTM and GRU perform reasonably well, primarily because they 
utilize dynamic counting [3] [7] but yet do not learn any state transitions. This is evidenced when considering their 
performance on on the complex Dyck(2) CFG where the majority of RNNs exhibit great difficulty in generalizing to 
longer sequences. These results do corroborate those of prior work, specifically those that demonstrate that the LSTM 
essentially performs a form of dynamic counting, making it ill-suited to recognizing complex grammars [36]. 


As pointed out by there is a strong need for neural architectures with external memory, i.e., a stack, to solve 
complex CFGs but, in this study, we furthermore argue that prior knowledge is also needed as well. This makes sense 
given that is known that prior information often leads to greatly improved reasoning and better generalization [21]. 
The stack and list RNNs do make use of (continuous) external memory (in fact, multiple stack/lists) but, theoretically, 
only one stack should be sufficient to recognize a PDA of any arbitrary length while a 2-stack PDA is as powerful as a 
Turning machine [37]. However, quite surprisingly, a stack-RNN with even 10 stacks has difficulty in generalizing 
to a complex grammar. This lines up with the theory — has proven that adding any more than 2 stacks toa PDA 
does not provide any further computational advantage. 


Finally, it is impressive to see that high order RNNs coupled with external memory, particularly with a discrete stack 
structure (as opposed to a continuous stack like that of the stack-RNN), perform so well across all CFGs. It is important 
to note that even the way our state-based RNN operates is markedly different than the way those of the past did — the 
NSPDA works as a next-step prediction model, which allows us to use the powerful iterative refinement procedure 
as a way to aggressively error correct its states when predicting string validity (at least during training time). Table 
[4]shows that our NSPDA model generalizes very well when trained on sequences of length T < 21 but tested on 
sequences on length up to T = 60. Finally, our results demonstrate the value of rule insertion, which, as we see 
empirically, in some cases, improved convergence speed by a wide margin. 


Conclusions 


In this work, we proposed the neural state pushdown automate (NSPDA) and its learning process, which utilizes an 
iterative refinement-based loss function, a two-stage incremental training procedure, an adaptive noise regularization 
scheme (which works with any higher order network), and a method for stably encoding rules into the model itself. 


Our experimental results, which focused on context-free grammars (CFGs), demonstrate that prior knowledge is essential 
to learning memory-augmented that recognize complex CFGs well. Notably, we have empirically demonstrated the 
expressvity and flexibility of a high order temporal neural model that learns how to manipulate an external discrete stack. 
While our proposed neural model works with a discrete stack, our model’s underlying framework could be extended to 
manipulate other kinds of data structures, a subject of future work. When training on various CFGs, the state-based 
neural models we optimize converge faster and are more expressive than even powerful classical models such as 
the neural network pushdown automaton. Furthermore, we have shown that modern-day, popular recurrent network 
structures (all of which are first order) struggle greatly to recognize complex grammars.These discovered limitations of 
first order RNNs indicates that ANN research should consider the exploration of more expressive, memory-augmented 
models that offer ways to better integrate prior knowledge. 
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Appendix 


Additional Results 


In Table[7| we report an expansion of the model performance table that appears in the main paper. In it, we report the 
performance of 3 modern gated RNNs with multiplicative gating units, i.e., MI-RNN, MI-LSTM, MI-GRU. Interestingly 
enough, one could consider the multiplicative units to be a crude approximation of second order state neurons. 


Table |5| shows results for stably programming the weights of the NSPDA which, in effect, demonstrates that a 
programmed NSPDA (without learning) is equivalent to complex grammar PDA. 


In the other table (Table (6), we highlight how various learning algorithms affect the generalization ability of higher 
order recurrent networks. Here, we compare back-propagation through time (BPTT) to other online learning algorithms 
such as real time recurrent learning (RTRL) and unbiased online recurrent optimization (UORO). We describe these 
procedures in further detail in the next section. 


Notably, in our experiments, we observed that UORO boosts performance for higher order recurrent networks, while 
being faster than RTRL, the original algorithm-of-choice when training higher order, state-based models. Furthermore, 
we remark that truncated BPTT (TBPTT), for some CFGs, can actually slightly improve model performance over BPTT 
(but in ohers, such as is the case for the palindrome CFG, lead to worse generalization). 


On Training Algorithms 


For all of the RNNs we study, we compared their (validation) performance when using various online and offline based 
learning algorithms. As mentioned in the last section, we found that UORO worked best for the NSPDA, which is 
advantageous in that UORO is faster than RTRL (even largely in terms of complexity) and does not require model 
unfolding like the popular and standard BPTT/TBPTT algorithms do. These results, again, are summarized in Table|6] 


Below we briefly describe the non-standard approaches to learning RNNs, specifically RTRL and UORO. Notably, we 
are the first to implement and adapt UORO in calculating the updates to the weights of higher order networks. 


Real-Time Recurrent Learning 


Real-time recurrent learning (RTRL) is a classical online learning procedure for training RNNs [27]. The aim is to 
optimize the parameters O of a state-based model in order to minimize a total (sequence) loss. The state model is 
abstract to the following function: 


Zt41 = Fotate(Xt41, Zt, O). (10) 


RTRL computes the derivative of the model’s states and outputs with respect to the synaptic weights during the model’s 
forward computation, as data points in the sequence are processed iteratively, i.e., without any unfolding as in BPTT. 
When the task is next step prediction (predict x; given a history x<;), the loss L to optimize, using RTRL, is defined as 
follows: 


Lipi OLe+1(Yt41, Vi41) ə OF out (Xt41, Zt, O) Oe p OF on (Xi41, Zt, O) (11) 
00 Oy Oz ðO 00 : 


Once we differentiate Equation[10|with respect to O, we obtain: 


Oze +1 — OF sate(Xt+1, Zt, O) 4 OF state(Xt41, Zt, O) 9 OZ 


00 00 Oz 00 


OZ 


Where at each time we compute 54 based on oa l 


3O 
The above is, in short, how RTRL calculates its gradients without resorting to backward transfer or computation graph 
unfolding (as in reverse-mode differentiation). Since the shape of ome is the same as |z| x |O|, for standard RNNs with n 
hidden units, this calculation scales as n4 (time complexity [38]). This high complexity makes RTRL highly impractical 
for training very wide and very deep recurrent models. However, in the case of a third order model like NSPDA (or an 
NNPDA), the number of states need for learning a target grammar are generally far fewer than those required of second 
or first order models (as we mentioned in the main paper). This means that a procedure such as RTRL is still applicable 
and useful at least for training RNNs to recognize context free grammars (of low input dimensionality). 


. These values are then used to directly compute tl, 
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Palindrome ab” a” b” ch™a™ ar tmp e™ 
Model n=60 | n=480 | n=960 | n=60 | n=480 | n=960 | n=60 | n=480 | n=960 | n=60 | n=480 | n=960 
2nd Order NSPDA 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
3rd Order NSPDA 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 


Table 5: Mean classification error results when using a programmed NSPDA (lower is better). 


Palindrome a”b” a®b™cb™a™ | ahtrmpnrem 

Learning Algorithm | M1 M2 M1 | M2 | M1 | M2 | M | M2 

BPTT 2.02 | 1.99 || 2.23 | 2.55 || 2.99 | 2.79 || 2.99 | 1.59 

TBPTT 2.59 | 2.81 || 1.05 | 1.29 || 2.97 | 2.02 || 1.58 | 1.11 

RTRL 0.02 | 0.19 || 0.09 | 0.10 || 1.85 | 0.07 i| 0.09 | 0.01 

UORO 0.00 | 0.00 || 0.06 | 0.01 || 0.99 | 0.00 || 0.09 | 0.00 

Table 6: Mean classification error for the NSPDA trained via various learning algorithms (tested on string length up to 
T = 60). 

Palindrome ab” a™b"cb™a™ A EMG Parenthesis 
RNN Type Train Test Train Test Train Test Train Test Train | Test 
RNN 0.00 78.2 0.00 | 74.11 0.00 | 83.33 0.00 | 73.69 || 30.72 | 99.96 
LSTM 0.00 | 12.58 0.00 | 13.26 0.00 | 14.22 0.00 | 10.56 || 48.92 | 97.88 
LSTM-p 0.00 | 8.69 0.00 | 11.25 0.00 | 13.99 0.00 | 12.88 || 49.68 | 99.00 
GRU 0.00 | 14.99 0.00 | 14.89 0.00 | 19.22 0.00 | 14.00 || 43.21 | 98.70 
Stack RNN 40+ 10 0.00 | 4.99 0.00 | 3.01 0.00 | 34.19 0.00 | 58.66 || 10.25 | 9.38 
Stack RNN 40+10+ rounding 0.00 | 0.09 0.00 | 0.89 0.00 1.01 0.00 | 0.79 4.03 | 3.968 
listRNN 40+5 0.00 | 0.39 0.00 | 2.29 0.00 | 19.63 0.00 1.27 4.89 | 7.45 
MI-RNN 0.00 | 75.69 0.00 | 70.26 0.00 | 76.69 0.00 | 73.01 || 29.58 | 99.92 
MI-LSTM 0.00 | 9.99 0.00 | 10.86 0.00 | 13.55 0.00 | 14.22 || 47.83 | 99.80 
MI-GRU 0.00 | 16.22 0.00 | 13.29 0.00 | 20.02 0.00 | 14.83 || 42.88 | 99.20 
2nd Order RNN 0.00 | 9.26 0.00 | 8.51 0.00 | 17.52 0.00 | 11.17 || 27.89 | 37.42 
2nd Order RNN reg (ours) 0.00 1.88 0.00 | 2.09 0.00 | 2.19 0.00 | 0.99 || 21.69 | 27.59 
NNPDA 0.00 7.00 0.00 | 15.25 0.00 | 17.49 0.00 | 55.28 5.96 | 29.21 
NNPDA reg (ours) 0.00 | 4.28 0.00 | 14.20 0.00 | 13.00 0.00 | 41.01 5.62 | 27.09 
NSPDA, M1 (ours) 0.00 | 0.00 0.00 | 0.06 0.00 | 0.99 0.00 | 0.09 0.58 | 2.58 
NSPDA, M2 (ours) 0.00 | 0.00 0.00 | 0.01 0.00 | 0.00 0.00 | 0.00 0.01 0.88 


Table 7: Mean classification error for various recurrent architectures when tested on strings of length up to T = 60. 


Unbiased Online Recurrent Optimization 


Unbiased Online Recurrent Optimization (UORO) [28] uses a rank-one trick to approximate the operations need to 
make RTRL’s gradient computation work. This trick helps to reduce the overall complexity of the at the price of 
increasing variance of its gradient estimates. 


When designing an optimizer like UORO, we start from the idea that for any given unbiased estimation of oe , we can 
form a stochastic matrix Žı such that (ŽŽ) = es . Since Equation{11|and{12]are affine in es , the “unbiasedness” (of 
gradient estimates) is preserved due to the ieit of the expectation. Next, we compute the value of Z, and plug 
it tinto[1IJand|12}to calculate the value for Leth and ae 1 In arank-one, unbiased approximation, at time step t, 
Z,= = 2, ® ©. To calculate Ze + latt + 1, we plug in Zi into Nonetheless, mathematically, the above is still not 
yet a rank-one approximation of RTRL. 


In order to finally obtain a proper rank-one approximation, one must use an additional, efficient approximation technique, 
proposed in [39], to rewrite the above equation as: 


5 F state 0 a ĝ T F state 0 
Zari = G tat (Xt41, Zt, P J nv) Q ( t + (v) OF stat (Xt41, Zt, 2), (13) 
Oz Po Pl o0 
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Note that v is a vector of independent, random signs and p contains k positive numbers. Thus, the rank-one trick 
can be applied for any p. In UORO, po and p; are factors meant to control the variance of the estimator’s computed 
approximate derivatives. In practice, we define pọ as: 


lõ 
Po z OF state (Xt ,Zt,0) ~ (14) 
V 
and p; is defined to be: 
| (v)T OF tate Kit 12¢,6) | 
a . 15 
i y [i (5) 


Initially, Žọ = 0 and Õo = 0, which yields unbiased estimates at time t = 0. Given the construction of the UORO 
procedure, by induction, all subsequent estimates can be shown to be unbiased as well. 
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